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Abstract 



We consider a class of learning problems regularized by a structured sparsity-inducing norm de- 
fined as the sum of £2- or £oo-norms over groups of variables. Whereas much effort has been put 
in developing fast optimization techniques when the groups are disjoint or embedded in a hierar- 
chy, we address here the case of general overlapping groups. To this end, we present two different 
CO I strategies: On the one hand, we show that the proximal operator associated with a sum of £<*>- 

norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, 
allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal 
splitting techniques, and address an equivalent formulation with non-overlapping groups, but in 
higher dimension and with additional constraints. We propose efficient and scalable algorithms 
exploiting these two strategies, which are significantly faster than alternative approaches. We illus- 
\ trate these methods with several problems such as CUR matrix factorization, multi-task learning 

of tree-structured dictionaries, background subtraction in video sequences, image denoising with 
wavelets, and topographic dictionary learning of natural image patches. 

Keywords: Convex optimization, proximal methods, sparse coding, structured sparsity, matrix 
factorization, network flow optimization, alternating direction method of multipliers. 



1. Introduction 

Sparse linear models have become a popular framework for dealing with various unsupervised and 
supervised tasks in machine learning and signal processing. In such models, linear combinations of 
small sets of variables are selected to describe the data. Regularization by the ^i-norm has emerged 
as a powerful tool for addressing this variable selection problem, relying on both a well-developed 
theory (see Tibshirani, 1996; Chen et al., 1999; Mallat, 1999; Bickel et al., 2009; Wainwright, 
2009, and references therein) and efficient algorithms (Efron et al., 2004; Nesterov, 2007; Beck and 
Teboulle, 2009; Needell and Tropp, 2009; Combettes and Pesquet, 2010). 



*. These authors contributed equally. 

f . When most of this work was conducted, all authors were affiliated to INRIA, WILLOW Project-Team. 



1 



Mairal, Jenatton, Obozinski and Bach 



The £i-norm primarily encourages sparse solutions, regardless of the potential structural rela- 
tionships (e.g., spatial, temporal or hierarchical) existing between the variables. Much effort has 
recently been devoted to designing sparsity-inducing regularizations capable of encoding higher- 
order information about the patterns of non-zero coefficients (Cehver et al., 2008; Jenatton et al., 
2009; Jacob et al., 2009; Zhao et al., 2009; He and Carin, 2009; Huang et al., 2009; Baraniuk et al., 
2010; Micchelli et al, 2010), with successful applications in bioinformatics (Jacob et al., 2009; Kim 
and Xing, 2010), topic modeling (Jenatton et al., 2010a, 2011) and computer vision (Cehver et al., 
2008; Huang et al, 2009; Jenatton et al., 2010b). By considering sums of norms of appropriate 
subsets, or groups, of variables, these regularizations control the sparsity patterns of the solutions. 
The underlying optimization is usually difficult, in part because it involves nonsmooth components. 

Our first strategy uses proximal gradient methods, which have proven to be effective in this 
context, essentially because of their fast convergence rates and their ability to deal with large prob- 
lems (Nesterov, 2007; Beck and Teboulle, 2009). They can handle differentiable loss functions with 
Lipschitz-continuous gradient, and we show in this paper how to use them with a regularization 
term composed of a sum of ^-norms. The second strategy we consider exploits proximal splitting 
methods (see Combettes and Pesquet, 2008, 2010; Goldfarg and Ma, 2009; Tomioka et al., 2011; 
Qin and Goldfarb, 2011; Boyd et al., 2011, and references therein), which builds upon an equivalent 
formulation with non-overlapping groups, but in a higher dimensional space and with additional 
constraints. 1 More precisely, we make four main contributions: 

• We show that the proximal operator associated with the sum of ^-norms with overlapping 
groups can be computed efficiently and exactly by solving a quadratic min-cost flow problem, 
thereby establishing a connection with the network flow optimization literature. 2 This is the 
main contribution of the paper, which allows us to use proximal gradient methods in the 
context of structured sparsity. 

• We prove that the dual norm of the sum of ^-norms can also be evaluated efficiently, which 
enables us to compute duality gaps for the corresponding optimization problems. 

• We present proximal splitting methods for solving structured sparse regularized problems. 

• We demonstrate that our methods are relevant for various applications whose practical suc- 
cess is made possible by our algorithmic tools and efficient implementations. First, we in- 
troduce a new CUR matrix factorization technique exploiting structured sparse regulariza- 
tion, built upon the links drawn by Bien et al. (2010) between CUR decomposition (Ma- 
honey and Drineas, 2009) and sparse regularization. Then, we illustrate our algorithms with 
different tasks: video background subtraction, estimation of hierarchical structures for dic- 
tionary learning of natural image patches (Jenatton et al., 2010a, 2011), wavelet image de- 

1. The idea of using this class of algorithms for solving structured sparse problems was first suggested to us by Jean- 
Christophe Pesquet and Patrick-Louis Combettes. It was also suggested to us later by Ryota Tomioka, who briefly 
mentioned this possibility in (Tomioka et al., 201 1). It can also briefly be found in (Boyd et al., 201 1), and in details 
in the work of Qin and Goldfarb (201 1) which was conducted as the same time as ours. It was also used in a related 
context by Sprechmann et al. (2010) for solving optimization problems with hierarchical norms. 

2. Interestingly, this is not the first time that network flow optimization tools have been used to solve sparse regularized 
problems with proximal methods. Such a connection was recently established by Chambolle and Darbon (2009) in 
the context of total variation regularization, and similarly by Hoefling (2010) for the fused Lasso. One can also find 
the use of maximum flow problems for non-convex penalties in the work of Cehver et al. (2008) which combines 
Markov random fields and sparsity. 
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noising with a structured sparse prior, and topographic dictionary learning of natural image 
patches (Hyvarinen et al., 2001; Kavukcuoglu et al., 2009; Garrigues and Olshausen, 2010). 

Note that this paper extends a shorter version published in Advances in Neural Information Process- 
ing Systems (Mairal et al., 2010b), by adding new experiments (CUR matrix factorization, wavelet 
image denoising and topographic dictionary learning), presenting the proximal splitting methods, 
providing the full proofs of the optimization results, and adding numerous discussions. 

1.1 Notation 

Vectors are denoted by bold lower case letters and matrices by upper case ones. We define for q > 1 
the £ 9 -norm of a vector x in M. m as ||x|| ? = (J^L l \xi\ q ) l l q , where x, denotes the i-th coordinate of x, 
and ||x||oo = max != i m |x, | = lim^oc ||x|| ? . We also define the ^o-pseudo-norm as the number of 
nonzero elements in a vector: 3 ||x||o = #{i s.t. x,- / 0} = lim^o+^^j |x,| 9 ). We consider the 
Frobenius norm of a matrix X in W" xn : ||X|| F = (Y4L1 T!j=i ^fj) 1 / 2 , where X, ; - denotes the entry 
of X at row i and column j. Finally, for a scalar y, we denote (y) + = max(y,0). For an integer 
p > 0, we denote by 2^'-' p ^ the powerset composed of the 2 P subsets of {1, . . . ,p}. 

The rest of this paper is organized as follows: Section 2 presents structured sparse models 
and related work. Section 3 is devoted to proximal gradient algorithms, and Section 4 to proxi- 
mal splitting methods. Section 5 presents several experiments and applications demonstrating the 
effectiveness of our approach and Section 6 concludes the paper. 

2. Structured Sparse Models 

We are interested in machine learning problems where the solution is not only known beforehand 
to be sparse — that is, the solution has only a few non-zero coefficients, but also to form non-zero 
patterns with a specific structure. It is indeed possible to encode additional knowledge in the regu- 
larization other than just sparsity. For instance, one may want the non-zero patterns to be structured 
in the form of non-overlapping groups (Turlach et al., 2005; Yuan and Lin, 2006; Stojnic et al., 
2009; Obozinski et al., 2010), in a tree (Zhao et al., 2009; Bach, 2009; Jenatton et al., 2010a, 2011), 
or in overlapping groups (Jenatton et al., 2009; Jacob et al., 2009; Huang et al., 2009; Baraniuk 
et al., 2010; Cehver et al., 2008; He and Carin, 2009), which is the setting we are interested in here. 

As for classical non-structured sparse models, there are basically two lines of research, that 
either (A) deal with nonconvex and combinatorial formulations that are in general computationally 
intractable and addressed with greedy algorithms or (B) concentrate on convex relaxations solved 
with convex programming methods. 

2.1 Nonconvex Approaches 

A first approach introduced by Baraniuk et al. (2010) consists in imposing that the sparsity pattern 
of a solution (i.e., its set of non-zero coefficients) is in a predefined subset of groups of variables 
g c 2^'-' p h Given this a priori knowledge, a greedy algorithm (Needell and Tropp, 2009) is used 



3. Note that it would be more proper to write ||x||[j instead of ||x||o to be consistent with the traditional notation ||x|| 9 . 
However, for the sake of simplicity, we will keep this notation unchanged in the rest of the paper. 
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to address the following nonconvex structured sparse decomposition problem 

1 , 

m in — I |y — XwlU s.t. Supp(w) G G and ||w||o<J, 
wgRp2 iij 112 y y 11 l|U - ' 

where s is a specified sparsity level (number of nonzeros coefficients), y in R m is an observed signal, 
X is a design matrix in W nxp and Supp(w) is the support of w (set of non-zero entries). 

In a different approach motivated by the minimum description length principle (see Barron et al, 
1998), Huang et al. (2009) consider a collection of groups G C 2^'-' p \ and define a "coding length" 
for every group in Q, which in turn is used to define a coding length for every pattern in 2^ ly "' p \ 
Using this tool, they propose a regularization function cl : R p — > R such that for a vector w in R p , 
cl(w) represents the number of bits that are used for encoding w. The corresponding optimization 
problem is also addressed with a greedy procedure: 

min -lly — Xwll? s.t. cl(w) < s, 

Intuitively, this formulation encourages solutions w whose sparsity patterns have a small coding 
length, meaning in practice that they can be represented by a union of a small number of groups. 
Even though they are related, this model is different from the one of Baraniuk et al. (2010). 

These two approaches are encoding a priori knowledge on the shape of non-zero patterns that 
the solution of a regularized problem should have. A different point of view consists of modelling 
the zero patterns of the solution — that is, define groups of variables that should be encouraged to 
be set to zero together. After defining a set Q C 2^'-' p ^ of such groups of variables, the following 
penalty can naturally be used as a regularization to induce the desired property 

w \ a v - * co/ \ if there exists y eg such that w ; t^O, 

¥(w) = 2^ ^(w), with 8*(w) = < . , . (1) 
ge g 10 otherwise, 

where the r|g's are positive weights. This penalty was considered by Bach (2010), who showed that 
the convex envelope of such nonconvex functions (more precisely strictly positive, non-increasing 
submodular functions of Supp(w), see Fujishige, 2005) when restricted on the unit ^-ball, are in 
fact types of structured sparsity-inducing norms which are the topic of the next section. 

2.2 Convex Approaches with Sparsity-inducing Norms 

In this paper, we are interested in convex regularizations which induce structured sparsity. Gener- 
ally, we consider the following optimization problem 

min/(w)+A£2(w), (2) 

where / : M. p — > R is a convex function (usually an empirical risk in machine learning and a data- 
fitting term in signal processing), and Q. : M. p — > R is a structured sparsity-inducing norm, defined as 

"(w)^= IrywJ, (3) 

where Q C 2^'-' p ^ is a set of groups of variables, the vector w g in Rl g l represents the coefficients 
of w indexed by g in G, the scalars r| g are positive weights, and |.| denotes the £2- or ^-norm. We 
now consider different cases: 
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• When Q is the set of singletons — that is Q = {{1}, {2}, . . . , {p}}, and all the r| g are equal to 
one, Q. is the £i-norm, which is well known to induce sparsity. This leads for instance to the 
Lasso (Tibshirani, 1996) or equivalently to basis pursuit (Chen et al., 1999). 

• If Q is a partition of { 1 , . . . ,/?}, i.e. the groups do not overlap, variables are selected in groups 
rather than individually. When the coefficients of the solution are known to be organized in 
such a way, explicitly encoding the a priori group structure in the regularization can improve 
the prediction performance and/or interpretability of the learned models (Turlach et al, 2005; 
Yuan and Lin, 2006; Roth and Fischer, 2008; Stojnic et al, 2009; Huang and Zhang, 2010; 
Obozinski et al., 2010). Such a penalty is commonly called group-Lasso penalty. 

• When the groups overlap, £2 is still a norm and sets groups of variables to zero together (Je- 
natton et al., 2009). The latter setting has first been considered for hierarchies (Zhao et al., 
2009; Kim and Xing, 2010; Bach, 2009; Jenatton et al., 2010a, 2011), and then extended to 
general group structures (Jenatton et al., 2009). Solving Eq. (2) in this context is a challenging 
problem which is the topic of this paper. 

Note that other types of structured-sparsity inducing norms have also been introduced, notably the 
approach of Jacob et al. (2009), which penalizes the following quantity 

ft'(w) ± min £ TiJS'H s.t. w = £ and Vg, Supp(^) C g. 

This penalty, which is also a norm, can be seen as a convex relaxation of the regularization intro- 
duced by Huang et al. (2009), and encourages the sparsity pattern of the solution to be a union of a 
small number of groups. Even though both Q. and Q.' appear under the terminology of "structured 
sparsity with overlapping groups", they have in fact significantly different purposes and algorith- 
mic treatments. For example, Jacob et al. (2009) consider the problem of selecting genes in a gene 
network which can be represented as the union of a few predefined pathways in the graph (groups 
of genes), which overlap. In this case, it is natural to use the norm Cl' instead of £2. On the other 
hand, we present a matrix factorization task in Section 5.3, where the set of zero-patterns should be 
a union of groups, naturally leading to the use of Q. Dealing with Q.' is therefore relevant, but out 
of the scope of this paper. 

2.3 Convex Optimization Methods Proposed in the Literature 

Generic approaches to solve Eq. (2) mostly rely on subgradient descent schemes (see Bertsekas, 
1999), and interior-point methods (Boyd and Vandenberghe, 2004). These generic tools do not 
scale well to large problems and/or do not naturally handle sparsity (the solutions they return may 
have small values but no "true" zeros). These two points prompt the need for dedicated methods. 

To the best of our knowledge, only a few recent papers have addressed problem Eq. (2) with 
dedicated optimization procedures, and in fact, only when Q. is a linear combination of ^2-norms. In 
this setting, a first line of work deals with the non-smoothness of Q. by expressing the norm as the 
minimum over a set of smooth functions. At the cost of adding new variables (to describe the set of 
smooth functions), the problem becomes more amenable to optimization. In particular, reweighted- 
£2 schemes consist of approximating the norm Q. by successive quadratic upper bounds (Argyriou 
et al., 2008; Rakotomamonjy et al., 2008; Jenatton et al., 2010b; Micchelli et al, 2010). It is possible 
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to show for instance that 

Plugging the previous relationship into Eq. (2), the optimization can then be performed by alternat- 
ing between the updates of w and the additional variables (z g ) g eg- 4 When the norm Q. is defined as a 
linear combination of ^-norms, we are not aware of the existence of such variational formulations. 

Problem (2) has also been addressed with working-set algorithms (Bach, 2009; Jenatton et al., 
2009; Schmidt and Murphy, 2010). The main idea of these methods is to solve a sequence of 
increasingly larger subproblems of (2). Each subproblem consists of an instance of Eq. (2) reduced 
to a specific subset of variables known as the working set. As long as some predefined optimality 
conditions are not satisfied, the working set is augmented with selected inactive variables (for more 
details, see Bach et al., 2011). 

The last approach we would like to mention is that of Chen et al. (2010), who used a smoothing 
technique introduced by Nesterov (2005). A smooth approximation Cl^ of £1 is used, when £1 is 
a sum of ^2-norms, and fi is a parameter controlling the trade-off between smoothness of Cl^ and 
quality of the approximation. Then, Eq. (2) is solved with accelerated gradient techniques (Beck 
and Teboulle, 2009; Nesterov, 2007) but Q.^ is substituted to the regularization CI. Depending on 
the required precision for solving the original problem, this method provides a natural choice for 
the parameter fj, with a known convergence rate. A drawback is that it requires to choose the 
precision of the optimization beforehand. Moreover, since a ^i-norm is added to the smoothed Q.^, 
the solutions returned by the algorithm might be sparse but possibly without respecting the structure 
encoded by Q. This should be contrasted with other smoothing techniques, e.g., the reweighted-^2 
scheme we mentioned above, where the solutions are only approximately sparse. 

3. Optimization with Proximal Gradient Methods 

We address in this section the problem of solving Eq. (2) under the following assumptions: 

• / is differentiable with Lipschitz-continuous gradient. For machine learning problems, this 
hypothesis holds when / is for example the square, logistic or multi-class logistic loss (see 
Shawe-Taylor and Cristianini, 2004). 

• £2 is a sum of loo-norms. Even though the ^-norm is sometimes used in the literature (Jenatton 
et al., 2009), and is in fact used later in Section 4, the £oc-norm is piecewise linear, and we 
take advantage of this property in this work. 

To the best of our knowledge, no dedicated optimization method has been developed for this setting. 
Following Jenatton et al. (2010a, 2011) who tackled the particular case of hierarchical norms, we 
propose to use proximal gradient methods, which we now introduce. 

4. Note that such a scheme is interesting only if the optimization with respect to w is simple, which is typically the case 
with the square loss function (Bach et al., 201 1). Moreover, for this alternating scheme to be provably convergent, the 
variables (z g ) ge g have to be bounded away from zero, resulting in solutions whose entries may have small values, 
but not "true" zeros. 
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3.1 Proximal Gradient Methods 

Proximal methods have drawn increasing attention in the signal processing (e.g., Wright et al., 
2009b; Combettes and Pesquet, 2010, and numerous references therein) and the machine learn- 
ing communities (e.g., Bach et al., 2011, and references therein), especially because of their con- 
vergence rates (optimal for the class of first-order techniques) and their ability to deal with large 
nonsmooth convex problems (e.g., Nesterov, 2007; Beck and Teboulle, 2009). 

These methods are iterative procedures that can be seen as an extension of gradient-based tech- 
niques when the objective function to minimize has a nonsmooth part. The simplest version of this 
class of methods linearizes at each iteration the function / around the current estimate w, and this 
estimate is updated as the (unique by strong convexity) solution of the proximal problem, defined as: 

min /(w) + (w — w) T V/(w) + X£2(w) + ^ II w — will- 

The quadratic term keeps the update in a neighborhood where / is close to its linear approximation, 
and L > is a parameter which is a upper bound on the Lipschitz constant of V/. This problem can 
be equivalently rewritten as: 

11 X 

w^2ll # -L V/(#) - W H2 + I fi(w) ' 
Solving efficiently and exactly this problem allows to attain the fast convergence rates of proximal 
methods, i.e., reaching a precision of O(|0 in k iterations. 5 In addition, when the nonsmooth 
term £1 is not present, the previous proximal problem exactly leads to the standard gradient update 
rule. More generally, we define the proximal operator. 

Definition 1 (Proximal Operator) 

The proximal operator associated with our regularization term XQ., which we denote by Prox\Q, is 
the function that maps a vector u G W p to the unique solution of 

1 9 

min - ||u-w||, + A£i(w). (4) 

This operator was initially introduced by Moreau (1962) to generalize the projection operator onto 
a convex set. What makes proximal methods appealing to solve sparse decomposition problems is 
that this operator can often be computed in closed form. For instance, 

• When Q. is the £i-norm — that is £2(w) = ||w||i — the proximal operator is the well-known 
elementwise soft-thresholding operator, 



VjG {!,..., p}, u;i->sign(uy)(|uy| -X) + 



if \uj\ < X 

sign (uj ) ( | Uj | — X) otherwise . 



• When CI is a group-Lasso penalty with ^2-norms — that is, £2(u) = Y*geg Il u gll2> with § being 
a partition of {1, . . . ,p}, the proximal problem is separable in every group, and the solution 
is a generalization of the soft-thresholding operator to groups of variables: 



VgG g ,ug^ug-ri|, || 2 <x[u g ] 




u sll2 



if ||ug||2 < X 
otherwise, 



5. Note, however, that fast convergence rates can also be achieved while solving approximately the proximal prob- 
lem (see Schmidt et al., 201 1, for more details). 
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where ny 2 <x denotes the orthogonal projection onto the ball of the ^-norm of radius X. 

• When Q. is a group-Lasso penalty with ^-norms — that is, £2(u) = J^geg ll u gll°°» with g being 
a partition of {1, ... ,p}, the solution is a different group-thresholding operator: 

where n^n^ denotes the orthogonal projection onto the £i-ball of radius X, which can be 
solved in O(p) operations (Brucker, 1984; Maculan and de Paula, 1989). Note that when 
||% 111 <^,we have a group-thresholding effect, with u g — Il|| .u^jjug] = 0. 

• When Q. is a tree-structured sum of £2- or ^-norms as introduced by Zhao et al. (2009) — 
meaning that two groups are either disjoint or one is included in the other, the solution admits 
a closed form. Let ^ be a total order on Q such that for gi,g2 in (3, gi ^ gi if and only if 
either g\ C g2 or gi(~\g2 = 0. 6 Then, if g\ < ... ^ g\g\, and if we define Prox g as (a) the 
proximal operator u g 1— > Proxj^ ||.|| (u g ) on the subspace corresponding to group g and (b) the 
identity on the orthogonal, Jenatton et al. (2010a, 2011) showed that: 

Prox^ = Prox gm o . . . o Prox gl , (5) 

which can be computed in 0{p) operations. It also includes the sparse group Lasso (sum of 
group-Lasso penalty and £\-nom\) of Friedman et al. (2010) and Sprechmann et al. (2010). 

The first contribution of our paper is to address the case of general overlapping groups with ^-norm. 
3.2 Dual of the Proximal Operator 

We now show that, for a set Q of general overlapping groups, a convex dual of the proximal 
problem (4) can be reformulated as a quadratic min-cost flow problem. We then propose an efficient 
algorithm to solve it exactly, as well as a related algorithm to compute the dual norm of Q. We start 
by considering the dual formulation to problem (4) introduced by Jenatton et al. (2010a, 2011): 

Lemma 1 (Dual of the proximal problem, Jenatton et al., 2010a, 2011) 

Given u in W, consider the problem 

min ^||u-£^||i s.t. y g eg,\m<ht\ g and ^ = 0ifj^g, (6) 

^r/>x|<?| 2 geg 

where %= (Z, 8 ) ge g is in M. px ^^, and ^ denotes the j-th coordinate of the vector if. Then, every 
solution ^ = (£ > * 8 ) ge g of Eq. (6) satisfies w* = u — Y*gegK g > where w* is the solution of Eq. (4) 
when Q. is a weighted sum of t^-norms. 

Without loss of generality, 7 we assume from now on that the scalars uj are all non-negative, and 
we constrain the entries of ^ to be so. Such a formulation introduces /?|^| dual variables which 
can be much greater than p, the number of primal variables, but it removes the issue of overlapping 
regularization. We now associate a graph with problem (6), on which the variables for g in § 
and j in g, can be interpreted as measuring the components of a flow. 

6. For a tree-structured set Q, such an order exists. 

7. Let i;* denote a solution of Eq. (6). Optimality conditions of Eq. (6) derived in Jenatton et al. (2010a, 201 1) show that 
for all j in {1, . . . ,p}, the signs of the non-zero coefficients for g in Q are the same as the signs of the entries Uj. 
To solve Eq. (6), one can therefore flip the signs of the negative variables Uj, then solve the modified dual formulation 
(with non-negative variables), which gives the magnitude of the entries (the signs of these being known). 
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3.3 Graph Model 

Let G be a directed graph G = (V,E,s,t), where V is a set of vertices, E C V x V a set of arcs, s 
a source, and t a sink. For all arcs in E, we define a non-negative capacity constant, and as done 
classically in the network flow literature (Ahuja et al., 1993; Bertsekas, 1998), we define a flow as a 
non-negative function on arcs that satisfies capacity constraints on all arcs (the value of the flow on 
an arc is less than or equal to the arc capacity) and conservation constraints on all vertices (the sum 
of incoming flows at a vertex is equal to the sum of outgoing flows) except for the source and the 
sink. For every arc e in E, we also define a real- valued cost function, which depends on the value of 
the flow on e. We now introduce the canonical graph G associated with our optimization problem: 

Definition 2 (Canonical Graph) 

Let Q C {1, . . . ,/?} be a set of groups, and (r\ g ) ge g be positive weights. The canonical graph 
G = (V,E,s,t) is the unique graph defined as follows: 

1. V = V u UV gr , where V u is a vertex set of size p, one vertex being associated to each index 
j in {1, . . . ,p}, and V gr is a vertex set of size \ Q\, one vertex per group g in Q. We thus 
have \V\ = \ Q\ + p. For simplicity, we identify groups g in Q and indices j in {1, . . . ,/?} with 
vertices of the graph, such that one can from now on refer to "vertex j" or "vertex g". 

2. For every group g in Q, E contains an arc (s,g). These arcs have capacity Xr\ g and zero cost. 

3. For every group g in Q, and every index j in g, E contains an arc (g,j) with zero cost and 
infinite capacity. We denote by 2j* the flow on this arc. 

4. For every index j in {l,...,p}, E contains an arc (j,t) with infinite capacity and a cost 
j(uj — £,j) 2 , where £ ■ is the flow on (j,t). 

Examples of canonical graphs are given in Figures la-c for three simple group structures. The 
flows Zfj associated with G can now be identified with the variables of problem (6). Since we have 
assumed the entries of u to be non-negative, we can now reformulate Eq. (6) as 

min fhuj-lj) 2 s.t. 1= and Vg € |l^<H? and Supp(^) C g ) . 

(V) 

Indeed, 

• the only arcs with a cost are those leading to the sink, which have the form (j,t), where j is 
the index of a variable in {1, . . . ,p}. The sum of these costs is J2^ =1 ^(uy — ^) 2 , which is the 
objective function minimized in Eq. (7); 

• by flow conservation, we necessarily have ^ = £ ge ^^ in the canonical graph; 

• the only arcs with a capacity constraints are those coming out of the source, which have the 
form (s,g), where g is a group in Q. By flow conservation, the flow on an arc (s,g) is Ey eg ^ 
which should be less than for\ g by capacity constraints; 

• all other arcs have the form (gj), where g is in § and j is in g. Thus, Supp(^ g ) C g. 
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Therefore we have shown that finding a flow minimizing the sum of the costs on such a graph is 
equivalent to solving problem (6). When some groups are included in others, the canonical graph 
can be simplified to yield a graph with a smaller number of edges. Specifically, if h and g are groups 
with he g, the edges (g, j) for jG/i carrying a flow can be removed and replaced by a single edge 
(g,h) of infinite capacity and zero cost, carrying the flow ^jeh^fj- This simplification is illustrated 
in Figure Id, with a graph equivalent to the one of Figure lc. This does not change the optimal value 

— k 

of t, , which is the quantity of interest for computing the optimal primal variable w*. We present in 
Appendix A a formal definition of equivalent graphs. These simplifications are useful in practice, 
since they reduce the number of edges in the graph and improve the speed of our algorithms. 

3.4 Computation of the Proximal Operator 

Quadratic min-cost flow problems have been well studied in the operations research literature 
(Hochbaum and Hong, 1995). One of the simplest cases, where Q contains a single group as in 
Figure la, is solved by an orthogonal projection on the f i-ball of radius Xrig. It has been shown, 
both in machine learning (Duchi et al., 2008) and operations research (Hochbaum and Hong, 1995; 
Brucker, 1984), that such a projection can be computed in 0{p) operations. When the group struc- 
ture is a tree as in Figure Id, strategies developed in the two communities are also similar (Jenatton 
et al., 2010a; Hochbaum and Hong, 1995), 8 and solve the problem in O(pd) operations, where d is 
the depth of the tree. 

The general case of overlapping groups is more difficult. Hochbaum and Hong (1995) have 
shown that quadratic min-cost flow problems can be reduced to a specific parametric max-flow 
problem, for which an efficient algorithm exists (Gallo et al., 1989). 9 While this generic approach 
could be used to solve Eq. (6), we propose to use Algorithm 1 that also exploits the fact that our 
graphs have non-zero costs only on edges leading to the sink. As shown in Appendix D, it it has 
a significantly better performance in practice. This algorithm clearly shares some similarities with 
existing approaches in network flow optimization such as the simplified version of Gallo et al. (1989) 
presented by Babenko and Goldberg (2006) that uses a divide and conquer strategy. Moreover, an 
equivalent algorithm exists for minimizing convex functions over polymatroid sets (Groenevelt, 
1991). This equivalence, a priori non trivial, is uncovered through a representation of structured 
sparsity-inducing norms via submodular functions, which was recently proposed by Bach (2010). 

The intuition behind our algorithm, computeFlow (see Algorithm 1), is the following: since £, = 
Hgeg^f i s tne on ly value of interest to compute the solution of the proximal operator w = u — ^, the 
first step looks for a candidate value yfor ^ by solving the following relaxed version of problem (7): 

argmin £ huj ■■- y y ) 2 s.t. £ Yy < ^ L Tfc. ( 8 ) 

yeRP je v u L jeV u g eV gr 

The cost function here is the same as in problem (7), but the constraints are weaker: Any feasi- 
ble point of problem (7) is also feasible for problem (8). This problem can be solved in linear 
time (Brucker, 1984). Its solution, which we denote y for simplicity, provides the lower bound 
||u — YH2/2 for the optimal cost of problem (7). 

8. Note however that, while Hochbaum and Hong (1995) only consider a tree-structured sum of <L-norms, the results 
from Jenatton et al. (2010a) also apply for a sum of ^-norms. 

9. By definition, a parametric max-flow problem consists in solving, for every value of a parameter, a max-flow problem 
on a graph whose arc capacities depend on this parameter. 
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"0" 

(a) £ = {^={1,2,3}}. 





(b)g = {g = {l,2},h = {2,3}}. 




(c) £={g={l,2,3},/i={2,3}}. 



(d)£ = {g = {l}uM = {2,3}}. 



Figure 1 : Graph representation of simple proximal problems with different group structures Q. The 
three indices 1,2,3 are represented as grey squares, and the groups g,h in Q as red discs. The 
source is linked to every group g,h with respective maximum capacity Xr\ g ,Xr\i, and zero cost. Each 
variable Uj is linked to the sink t, with an infinite capacity, and with a cost cj = 5 (u ; — ^ ; -) 2 . All other 
arcs in the graph have zero cost and infinite capacity. They represent inclusion relations in-between 
groups, and between groups and variables. The graphs (c) and (d) correspond to a special case of 
tree-structured hierarchy in the sense of Jenatton et al. (2010a). Their min-cost flow problems are 
equivalent. 
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Algorithm 1 Computation of the proximal operator for overlapping groups. 

input u € W, a set of groups Q, positive weights (r\g) g eg, and X (regularization parameter). 
1: Build the initial graph Go = (Vo,Eo,s,t) as explained in Section 3.4. 
2: Compute the optimal flow: £ «— computeFlow(Vb,£b)- 
3: Return: w = u — £, (optimal solution of the proximal problem). 

Function computeFlow(V = V u UV gr ,E) 

1: Projection step: y ^ argmin^yjtuy -y,-) 2 s.t. Ljevjj < ^Lgev^g- 
2: For all nodes j in V u , set jj to be the capacity of the arc (j,t). 

3: Max-flow step: Update (^j)jev u by computing a max-flow on the graph (V,E,s,t). 

4: if 3 j € V u s.t. lj / y y then 

5: Denote by (s,V + ) and (V~,t) the two disjoint subsets of (V,s,t) separated by the minimum 
(s,t)-cut of the graph, and remove the arcs between V + and V - . Call E + and E~ the two 
remaining disjoint subsets of E corresponding to V + and V~. 

6: (t,j) jeV + <- computeFlow(V+,£' + ). 

7: (^ ; -)y Gl/ - <^ computeFlow(V~,£'~). 
8: end if 

9: Return: (^)j€V„- 



The second step tries to construct a feasible flow (^,^), satisfying additional capacity constraints 
equal to Jj on arc (j,t), and whose cost matches this lower bound; this latter problem can be cast 
as a max-flow problem (Goldberg and Tarjan, 1986). If such a flow exists, the algorithm returns 
^ = y, the cost of the flow reaches the lower bound, and is therefore optimal. If such a flow does 
not exist, we have ^ / y, the lower bound is not achievable, and we build a minimum (s,t)-cut of 
the graph (Ford and Fulkerson, 1956) defining two disjoints sets of nodes V + and V~; V + is the 
part of the graph which is reachable from the source (for every node j in V + , there exists a non- 
saturated path from s to j), whereas all paths going from s to nodes in V ~ are saturated. More details 
about these properties can be found at the beginning of Appendix B. At this point, it is possible to 
show that the value of the optimal min-cost flow on all arcs between V + and V~ is necessary zero. 
Thus, removing them yields an equivalent optimization problem, which can be decomposed into two 
independent problems of smaller sizes and solved recursively by the calls to computeFlow(y + ,E + ) 
and computeFlow(y ~ ,E~). A formal proof of correctness of Algorithm 1 and further details are 
relegated to Appendix B. 

The approach of Hochbaum and Hong (1995); Gallo et al. (1989) which recasts the quadratic 
min-cost flow problem as a parametric max-flow is guaranteed to have the same worst-case com- 
plexity as a single max-flow algorithm. However, we have experimentally observed a significant 
discrepancy between the worst case and empirical complexities for these flow problems, essentially 
because the empirical cost of each max-flow is significantly smaller than its theoretical cost. Despite 
the fact that the worst-case guarantees for our algorithm is weaker than theirs (up to a factor |V|), it 
is more adapted to the structure of our graphs and has proven to be much faster in our experiments 
(see Appendix D). 10 Some implementation details are also crucial to the efficiency of the algorithm: 



10. The best theoretical worst-case complexity of a max-flow is achieved by Goldberg and Tarjan (1986) and is 
O ( | V | \E | log (|V| 2 /|£|)). Our algorithm achieves the same worst-case complexity when the cuts are well balanced — 
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• Exploiting maximal connected components: When there exists no arc between two sub- 
sets of V, the solution can be obtained by solving two smaller optimization problems cor- 
responding to the two disjoint subgraphs. It is indeed possible to process them indepen- 
dently to solve the global min-cost flow problem. To that effect, before calling the function 

computeFlow(V,£'), we look for maximal connected components (Vi,Z?i), . . . , (V^,E N ) and 
call sequentially the procedure computeFlow(V;,E,-) for i in {1, . . . ,N}. 

• Efficient max-flow algorithm: We have implemented the "push-relabel" algorithm of Gold- 
berg and Tarjan (1986) to solve our max-flow problems, using classical heuristics that signif- 
icantly speed it up in practice; see Goldberg and Tarjan (1986) and Cherkassky and Goldberg 
(1997). We use the so-called "highest-active vertex selection rule, global and gap heuris- 
tics" (Goldberg and Tarjan, 1986; Cherkassky and Goldberg, 1997), which has a worst-case 
complexity of C(| V| 2 |£"| x / 2 ) for a graph (V,E,s,t). This algorithm leverages the concept of 
pre-flow that relaxes the definition of flow and allows vertices to have a positive excess. 

• Using flow warm-restarts: The max-flow steps in our algorithm can be initialized with any 
valid pre-flow, enabling warm-restarts. This is also a key concept in the parametric max-flow 
algorithm of Gallo et al. (1989). 

• Improved projection step: The first line of the procedure computeFlow can be replaced by 
Y^argmmylywjK-y.) 2 s .t. Zjev u Y/ < ^L g ev gr T\g and |y,-| < ^Lgsj^g- The ideais to 
build a relaxation of Eq. (7) which is closer to the original problem than the one of Eq. (8), 
but that still can be solved in linear time. The structure of the graph will indeed not allow £ ■ 
to be greater than A,^ 3 yT| g after the max-flow step. This modified projection step can still be 
computed in linear time (Brucker, 1984), and leads to better performance. 

3.5 Computation of the Dual Norm 

The dual norm CI* of CI, defined for any vector K in R p by 

£2*(k) = max z T K, 

£2(z)<l 

is a key quantity to study sparsity-inducing regularizations in many respects. For instance, dual 
norms are central in working-set algorithms (Jenatton et al., 2009; Bach et al., 2011), and arise as 
well when proving theoretical estimation or prediction guarantees (Negahban et al., 2009). 

In our context, we use it to monitor the convergence of the proximal method through a duality 
gap, hence defining a proper optimality criterion for problem (2). As a brief reminder, the duality 
gap of a minimization problem is defined as the difference between the primal and dual objective 
functions, evaluated for a feasible pair of primal/dual variables (see Section 5.5, Boyd and Van- 
denberghe, 2004). This gap serves as a certificate of (sub)optimality: if it is equal to zero, then 
the optimum is reached, and provided that strong duality holds, the converse is true as well (see 
Section 5.5, Boyd and Vandenberghe, 2004). A description of the algorithm we use in the experi- 
ments (Beck and Teboulle, 2009) along with the integration of the computation of the duality gap is 
given in Appendix C. 

that is |V + | ~ \V~ | ~ M/2, but we lose a factor |K| when it is not the case. The practical speed of such algorithms is 
however significantly different than their theoretical worst-case complexities (see Boykov and Kolmogorov, 2004). 



13 



Mairal, Jenatton, Obozinski and Bach 



We now denote by /* the Fenchel conjugate of / (Borwein and Lewis, 2006), defined by 
f* (k) = sup z [z T K — /(z)]. The duality gap for problem (2) can be derived from standard Fenchel 
duality arguments (Borwein and Lewis, 2006) and it is equal to 

/(w) +AXl(w) +/*(-k) for w,k in R p with Q*(k) < I. 

Therefore, evaluating the duality gap requires to compute efficiently £2* in order to find a feasible 
dual variable K (the gap is otherwise equal to +oo and becomes non-informative). This is equivalent 
to solving another network flow problem, based on the following variational formulation: 

Q*(K) = min x s.t. £ Z* = K, and Vg G Q, \\^% < zr\ g with & = if j £ g. (9) 

In the network problem associated with (9), the capacities on the arcs (s,g), g G Q, are set to ir\ g , 
and the capacities on the arcs (j,t), j in {1, . . . ,/?}, are fixed to Ky. Solving problem (9) amounts 
to finding the smallest value of x, such that there exists a flow saturating all the capacities K, on the 
arcs leading to the sink t. Equation (9) and Algorithm 2 are proven to be correct in Appendix B. 

Algorithm 2 Computation of the dual norm. 

input K 6 M. p , a set of groups Q, positive weights (r\ g ) ge g. 
1: Build the initial graph Go = (Vo,Eo,s,t) as explained in Section 3.5. 
2: X «- dualNorm(Vb,£b)- 
3: Return: x (value of the dual norm). 

Function dualNorm(V = V„ U V gr ,E) 

1: T«— {Y,jev u K j) / (Y,g€V r^g) an d set the capacities of arcs (s,g) to ir\ g for all g in V gr . 

2: Max-flow step: Update (t,j)j €Vu by computing a max-flow on the graph (V,E,s,t). 

3: if 3 j € V u s.t. \j / Kj then 

4: Define (V + ,E + ) and (y - ,^ - ) as in Algorithm 1, and set x <- dualNorm(y^,£' _ ). 

5: end if 

6: Return: x. 



4. Optimization with Proximal Splitting Methods 

We now present proximal splitting algorithms (see Combettes and Pesquet, 2008, 2010; Tomioka 
et al., 2011; Boyd et al., 2011, and references therein) for solving Eq. (2). Differentiability of / is 
not required here and the regularization function can either be a sum of £2- or ^-norms. However, 
we assume that: 

(A) either / can be written /(w) = YJ}=\ fi(' w )-> where the functions fi are such that prox^ can be 
obtained in closed form for all y > and all i — that is, for all u in W", the following problems 
admit closed form solutions: min vG iRm — v||| + y^(v). 

(B) or / can be written /(w) = /(Xw) for all w in R p , where X in R nxp is a design matrix, and 
one knows how to efficiently compute prox ? for all y > 0. 
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It is easy to show that this condition is satisfied for the square and hinge loss functions, making it 
possible to build linear SVMs with a structured sparse regularization. These assumptions are not 
the same as the ones of Section 3, and the scope of the problems addressed is therefore slightly dif- 
ferent. Proximal splitting methods seem indeed to offer more flexibility regarding the regularization 
function, since they can deal with sums of ^-norms. 11 However, proximal gradient methods, as 
presented in Section 3, enjoy a few advantages over proximal splitting methods, namely: automatic 
parameter tuning with line-search schemes (Nesterov, 2007), known convergence rates (Nesterov, 
2007; Beck and Teboulle, 2009), and ability to provide sparse solutions (approximate solutions 
obtained with proximal splitting methods often have small values, but not "true" zeros). 

4.1 Algorithms 

We consider a class of algorithms which leverage the concept of variable splitting (see Combettes 
and Pesquet, 2010; Bertsekas and Tsitsiklis, 1989; Tomioka et al., 2011). The key is to introduce 
additional variables z g in R^l, one for every group g in Q, and equivalently reformulate Eq. (2) as 

mm f(w) + X £ r,Jz*|| s.t. Vg eg, = w g , (10) 

zSeR |sl for geg ge ^ 

The issue of overlapping groups is removed, but new constraints are added, and as in Section 3, the 
method introduces additional variables which induce a memory cost of 0(Y, ge g \g\). 

To solve this problem, it is possible to use the so-called alternating direction method of multi- 
pliers (ADMM) (see Combettes and Pesquet, 2010; Bertsekas and Tsitsiklis, 1989; Tomioka et al., 
2011; Boyd et al., 2011). 12 It introduces dual variables V s in IR^I for all g in Q, and defines the 
augmented Lagrangian: 

L(w, (z*) geg , (vt) geg ) 4 /(w) + £ + v* T (z* " w,) + ~ wjl] , 

geg 

where y > is a parameter. It is easy to show that solving Eq. (10) amounts to finding a saddle- 
point of the augmented Lagrangian. 13 The ADMM algorithm finds such a saddle-point by iterating 
between the minimization of L with respect to each primal variable, keeping the other ones fixed, 
and gradient ascent steps with respect to the dual variables. More precisely, it can be summarized as: 

1 . Minimize L with respect to w, keeping the other variables fixed. 

11. We are not aware of any efficient algorithm providing the exact solution of the proximal operator associated to a sum 
of £2-norms, which would be necessary for using (accelerated) proximal gradient methods. An iterative algorithm 
could possibly be used to compute it approximately (e.g., see Jenatton et al., 2010a, 2011), but such a procedure 
would be computationally expensive and would require to be able to deal with approximate computations of the 
proximal operators (e.g., see Combettes and Pesquet, 2010; Schmidt et al., 2011, and discussions therein). We have 
chosen not to consider this possibility in this paper. 

12. This method is used by Sprechmann et al. (2010) for computing the proximal operator associated to hierarchical 
norms, and independently in the same context as ours by Boyd et al. (201 1) and Qin and Goldfarb (201 1). 

13. The augmented Lagrangian is in fact the classical Lagrangian (see Boyd and Vandenberghe, 2004) of the following 
optimization problem which is equivalent to Eq. (10): 

min /(w) + X£^||z«|| + |||z«-w g ||^ s.t. Vg € Q, z«^w s . 

weRf,(zsgRlsl) seff ~g 2 
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2. Minimize L with respect to the z g 's, keeping the other variables fixed. The solution can be 
obtained in closed form: for all g in Q, z g «— prox^ [w g — ^V g ]. 

y 1 1 " 1 1 ' 

3. Take a gradient ascent step on L with respect to the v g, s: v g «— v g +y(z g — w g ). 

4. Go back to step 1. 

Such a procedure is guaranteed to converge to the desired solution for all value of y > (however, 
tuning yean greatly influence the convergence speed), but solving efficiently step 1 can be difficult. 
To cope with this issue, we propose two variations exploiting assumptions (A) and (B). 

4.1.1 Splitting the Loss Function / 

We assume condition (A) — that is, we have /(w) = £" =1 fi(vf). For example, when / is the square 
loss function /(w) = ^ ||y — XwHj, where X in M. nxp is a design matrix and y is in W, we would 
define for all i in {1, ... ,n} the functions /; : R — > R such that fi(v/) = ^(y; — x^w) 2 , where x; is 
the i-th row of X. 

We now introduce new variables v' in W for i = l,...,n, and replace /(w) in Eq. (10) by 
E?=i^'( vi )' tne additional constraints that v' = w. The resulting equivalent optimization prob- 
lem can now be tackled using the ADMM algorithm, following the same methodology presented 
above. It is easy to show that every step can be obtained efficiently, as long as one knows how to 
compute the proximal operator associated to the functions f t in closed form. This is in fact the case 
for the square and hinge loss functions, where n is the number of training points. The main problem 
of this strategy is the possible high memory usage it requires when n is large. 

4.1.2 Dealing with the Design Matrix 

If we assume condition (B), another possibility consists of introducing a new variable v in R", such 
that one can replace the function /(w) = /(Xw) by f(\) in Eq. (10) with the additional constraint 
v = Xw. Using directly the ADMM algorithm to solve the corresponding problem implies adding 
a term k t (v — Xw) + |||v — Xw||2 to the augmented Lagrangian L, where K is a new dual vari- 
able. The minimization of L with respect to v is now obtained by v ^— proxi ?[Xw — k], which is 

Y 

easy to compute according to (B). However, the design matrix X in the quadratic term makes the 
minimization of L with respect to w more difficult. To overcome this issue, we adopt a strategy 
presented by Zhang et al. (201 1), which replaces at iteration k the quadratic term | ||v — Xw||2 in the 
augmented Lagrangian by an additional proximity term: |||v — XwHj + \ ||w — w*||q, where is 
the current estimate of w, and ||w — w*||q = (w — w^) T Q(w — w*), where Q is a symmetric posi- 
tive definite matrix. By choosing Q = 81 — X T X, with 8 large enough, minimizing L with respect 
to w becomes simple, while convergence to the solution is still ensured. More details can be found 
in Zhang et al. (2011). 

5. Applications and Experiments 

In this section, we present various experiments demonstrating the applicability and the benefits of 
our methods for solving large-scale sparse and structured regularized problems. 
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5.1 Speed Benchmark 

We consider a structured sparse decomposition problem with overlapping groups of ^-norms, and 
compare the proximal gradient algorithm FISTA (Beck and Teboulle, 2009) with our proximal op- 
erator presented in Section 3 (referred to as ProxFlow), two variants of proximal splitting methods, 
(ADMM) and (Lin-ADMM) respectively presented in Section 4.1.1 and 4.1.2, and two generic 
optimization techniques, namely a subgradient descent (SG) and an interior point method, 14 on a 
regularized linear regression problem. SG, ProxFlow, ADMM and Lin-ADMM are implemented 
in C++. 15 Experiments are run on a single-core 2.8 GHz CPU. We consider a design matrix X in 
l" xp built from overcomplete dictionaries of discrete cosine transforms (DCT), which are naturally 
organized on one- or two-dimensional grids and display local correlations. The following families 
of groups Q using this spatial information are thus considered: (1) every contiguous sequence of 
length 3 for the one-dimensional case, and (2) every 3 x 3-square in the two-dimensional setting. We 
generate vectors y in MP according to the linear model y = Xwo + £, where £ ~ £^(0,0.01 ||Xwo H2)- 
The vector wo has about 20% percent nonzero components, randomly selected, while respecting the 
structure of Q, and uniformly generated in [—1,1]. 

In our experiments, the regularization parameter \ is chosen to achieve the same level of spar- 
sity (20%). For SG, ADMM and Lin-ADMM, some parameters are optimized to provide the low- 
est value of the objective function after 1000 iterations of the respective algorithms. For SG, 
we take the step size to be equal to a/{k + b), where k is the iteration number, and (a,b) are 
the pair of parameters selected in {10~ 3 , . . . , 10} x {10 2 , 10 3 , 10 4 }. Note that a step size of the 
form a/{\/t + b) is also commonly used in subgradient descent algorithms. In the context of hi- 
erarchical norms, both choices have led to similar results (Jenatton et al, 2011). The parameter y 
for ADMM is selected in {10~ 2 , . . . , 10 2 }. The parameters (y,8) for Lin-ADMM are selected in 
{10~ 2 , . . . , 10 2 } x {10 _1 , . . . , 10 8 }. For interior point methods, since problem (2) can be cast either 
as a quadratic (QP) or as a conic program (CP), we show in Figure 2 the results for both formu- 
lations. On three problems of different sizes, with (n,p) G {(100, 10 3 ), (1024, 10 4 ), (1024, 10 5 )}, 
our algorithms ProxFlow, ADMM and Lin-ADMM compare favorably with the other methods, (see 
Figure 2), except for ADMM in the large-scale setting which yields an objective function value 
similar to that of SG after 10 4 seconds. Among ProxFlow, ADMM and Lin-ADMM, ProxFlow 
is consistently better than Lin-ADMM, which is itself better than ADMM. Note that for the small 
scale problem, the performance of ProxFlow and Lin-ADMM is similar. In addition, note that QP, 
CP, SG, ADMM and Lin-ADMM do not obtain sparse solutions, whereas ProxFlow does. 16 

5.2 Wavelet Denoising with Structured Sparsity 

We now illustrate the results of Section 3, where a single large-scale proximal operator (p « 250000) 
associated to a sum of -norms has to be computed. We choose an image denoising task with 
an orthonormal wavelet basis, following an experiment similar to one proposed in Jenatton et al. 
(2011). Specifically, we consider the following formulation 

min i||y-Xw||? + AX2(w), (11) 

weRp 2 

14. In our simulations, we use the commercial software Mosek, http : //www.mosek . com/ 

15. Our implementation of ProxFlow is available at http: //www.di .ens . fr /willow/ SPAMS / . 

16. To reduce the computational cost of this experiment, the curves reported are the results of one single run. Similar 
types of experiments with several runs have shown very small variability (Bach et al., 201 1). 
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n=100, p=1000, one-dimensional DCT n=1024, p=10000, one-dimensional DCT n=1024, p=1 00000, one-dimensional DCT 




log(Seconds) log(Seconds) log(Seconds) 



Figure 2: Speed comparisons: distance to the optimal primal value versus CPU time (log-log scale). 
Due to the computational burden, QP and CP could not be run on every problem. 

where y in W p is a noisy input image, w represents wavelets coefficients, X in W pxp is an orthonor- 
mal wavelet basis, Xw is the estimate of the denoised image, and £2 is a sparsity-inducing norm. 
Since here the basis is orthonormal, solving the decomposition problem boils down to computing 
w* = prox^[X T y]. This makes of Algorithm 1 a good candidate to solve it when Q. is a sum of 
^co-norms. We compare the following candidates for the sparsity-inducing norms O: 

• the ^i-norm, leading to the wavelet soft-thresholding of Donoho and Johnstone (1995). 

• a sum of £oo-norms with a hierarchical group structure adapted to the wavelet coefficients, 
as proposed in Jenatton et al. (2011). Considering a natural quad-tree for wavelet coeffi- 
cients (see Mallat, 1999), this norm takes the form of Eq. (3) with one group per wavelet 
coefficient that contains the coefficient and all its descendants in the tree. We call this norm 

^tree ■ 

• a sum of £oo-norms with overlapping groups representing 2x2 spatial neighborhoods in the 
wavelet domain. This regularization encourages neighboring wavelet coefficients to be set 
to zero together, which was also exploited in the past in block-thresholding approaches for 
wavelet denoising (Cai, 1999). We call this norm £2 gr id. 

We consider Daubechies3 wavelets (see Mallat, 1999) for the matrix X, use 12 classical standard 
test images, 17 and generate noisy versions of them corrupted by a white Gaussian noise of vari- 
ance a 2 . For each image, we test several values of X = 2*0\/logp, with i taken in the range 
{— 15, — 14, . . . , 15}. We then keep the parameter A. giving the best reconstruction error on average 
on the 12 images. The factor G^/log p is a classical heuristic for choosing a reasonable regulariza- 
tion parameter (see Mallat, 1999). We provide reconstruction results in terms of PSNR in Table 1 . 18 
Unlike Jenatton et al. (201 1), who set all the weights r\ g in D. equal to one, we tried exponential 
weights of the form r\ g = p k , with k being the depth of the group in the wavelet tree, and p is taken 
in {0.25,0.5, 1,2,4}. As for A-, the value providing the best reconstruction is kept. The wavelet 
transforms in our experiments are computed with the matlabPyrTools software. 19 Interestingly, we 
observe in Table 1 that the results obtained with Q. g nd are significantly better than those obtained 

17. These images are used in classical image denoising benchmarks. See Mairal et al. (2009). 

18. Denoting by MSE the mean-squared-error for images whose intensities are between and 255, the PSNR is defined 
as PSNR = 101og 10 (255 2 /MSE) and is measured in dB. A gain of ldB reduces the MSE by approximately 20%. 

19. http : / /www. ens . nyu . edu/~eero/steerpyr/. 
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PSNR 


IPSNR vs. £i 


g 


k 


^tree 


^grid 


h 


^tree 


£lgnd 


5 


35.67 


35.98 


36.15 


0.00 ±.0 


0.31 ± .18 


0.48 ± .25 


10 


31.00 


31.60 


31.88 


0.00 ±.0 


0.61 ±.28 


0.88 ±28 


25 


25.68 


26.77 


27.07 


0.00 ±.0 


1.09 ±.32 


1.38 ±26 


50 


22.37 


23.84 


24.06 


0.00 ±.0 


1.47 ±.34 


1.68 ±41 


100 


19.64 


21.49 


21.56 


0.00 ±.0 


1.85 ±.28 


1.92 ±29 



Table 1: PSNR measured for the denoising of 12 standard images when the regularization function 
is the £i-norm, the tree-structured norm £2 t ree, and the structured norm £2 gr id, and improvement in 
PSNR compared to the ^-norm (IPSNR). Best results for each level of noise and each wavelet type 
are in bold. The reported values are averaged over 5 runs with different noise realizations. 

with n tree , meaning that encouraging spatial consistency in wavelet coefficients is more effective 
than using a hierarchical coding. We also note that our approach is relatively fast, despite the high 
dimension of the problem. Solving exactly the proximal problem with £2 gr id for an image with 
p = 512 x 512 = 262 144 pixels (and therefore approximately the same number of groups) takes 
approximately ps 4 — 6 seconds on a single core of a 3.07GHz CPU. 

5.3 CUR-like Matrix Factorization 

In this experiment, we show how our tools can be used to perform the so-called CUR matrix decom- 
position (Mahoney and Drineas, 2009). It consists of a low-rank approximation of a data matrix X 
in M. nxp in the form of a product of three matrices — that is, X ps CUR. The particularity of the CUR 
decomposition lies in the fact that the matrices C 6 R" xc and R G W xp are constrained to be respec- 
tively a subset of c columns and r rows of the original matrix X. The third matrix U G IR cxr is then 
given by C + XR + , where A + denotes a Moore-Penrose generalized inverse of the matrix A (Horn 
and Johnson, 1990). Such a matrix factorization is particularly appealing when the interpretability 
of the results matters (Mahoney and Drineas, 2009). For instance, when studying gene-expression 
datasets, it is easier to gain insight from the selection of actual patients and genes, rather than from 
linear combinations of them. 

In Mahoney and Drineas (2009), CUR decompositions are computed by a sampling procedure 
based on the singular value decomposition of X. In a recent work, Bien et al. (2010) have shown 
that partial CUR decompositions, i.e., the selection of either rows or columns of X, can be obtained 
by solving a convex program with a group-Lasso penalty. We propose to extend this approach to 
the simultaneous selection of both rows and columns of X, with the following convex problem: 

min i||X-XWX||2+?, row V HW'lloo + ^coi £ \\W j\W (12) 
WelR" x "2 fr[ p[ J 

In this formulation, the two sparsity-inducing penalties controlled by the parameters X,- ow and X co \ set 
to zero some entire rows and columns of the solutions of problem (12). Now, let us denote by Wu 
in Rl I l x l J l the submatrix of W reduced to its nonzero rows and columns, respectively indexed by 
I C {1, . . . ,/?} and J C {1, . . . ,«}. We can then readily identify the three components of the CUR 
decomposition of X, namely 

XWX = CWuR » X. 
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Problem (12) has a smooth convex data-fitting term and brings into play a sparsity-inducing norm 
with overlapping groups of variables (the rows and the columns of W). As a result, it is a partic- 
ular instance of problem (2) that can therefore be handled with the optimization tools introduced 
in this paper. We now compare the performance of the sampling procedure from Mahoney and 
Drineas (2009) with our proposed sparsity-based approach. To this end, we consider the four gene- 
expression datasets 9_Tumors, Brain_Tumorsl, Leukemial and SRBCT, with respective dimensions 
(n,p) e {(60,5727), (90,5921), (72,5328), (83, 2309)}. 20 In the sequel, the matrix X is normalized 
to have unit Frobenius-norm while each of its columns is centered. To begin with, we run our ap- 
proach 21 over a grid of values for Ar OW and A, co i in order to obtain solutions with different sparsity 
levels, i.e., ranging from |I| = p and |J| = n down to |I| = |J| = 0. For each pair of values [|I|, |J|], 
we then apply the sampling procedure from Mahoney and Drineas (2009). Finally, the variance 
explained by the CUR decompositions is reported in Figure 3 for both methods. Since the sampling 
approach involves some randomness, we show the average and standard deviation of the results 
based on five initializations. The conclusions we can draw from the experiments match the ones 
already reported in Bien et al. (2010) for the partial CUR decomposition. We can indeed see that 
both schemes perform similarly. However, our approach has the advantage not to be randomized, 
which can be less disconcerting in the practical perspective of analyzing a single run of the algo- 
rithm. It is finally worth being mentioned that the convex approach we develop here is flexible and 
can be extended in different ways. For instance, we can imagine to add further low-rank/sparsity 
constraints on W thanks to sparsity-promoting convex regularizations. 

5.4 Background Subtraction 

Following Cehver et al. (2008); Huang et al. (2009), we consider a background subtraction task. 
Given a sequence of frames from a fixed camera, we try to segment out foreground objects in a new 
image. If we denote by y G W 1 this image composed of n pixels, we model y as a sparse linear 
combination of p other images X £ M" xp , plus an error term e in R", i.e., y « Xw + e for some 
sparse vector w in W p . This approach is reminiscent of Wright et al. (2009a) in the context of face 
recognition, where e is further made sparse to deal with small occlusions. The term Xw accounts 
for background parts present in both y and X, while e contains specific, or foreground, objects in y. 
The resulting optimization problem is given by 

min I|| y -Xw-e||i + A,i||w||i+^{||e||i+ft(e)}, with^i, X 2 >0. (13) 

weR'>,eeR" 2 

In this formulation, the only ^i-norm penalty does not take into account the fact that neighboring 
pixels in y are likely to share the same label (background or foreground), which may lead to scattered 
pieces of foreground and background regions (Figure 4). We therefore put an additional structured 
regularization term £1 on e, where the groups in Q are all the overlapping 3 x 3-squares on the 
image. For the sake of comparison, we also consider the regularization Cl where the groups are 
non-overlapping 3 x 3-squares. 

This optimization problem can be viewed as an instance of problem (2), with the particular 
design matrix [X, I] in R" x defined as the columnwise concatenation of X and the identity 

20. The datasets are freely available at http: / /www . gems-system, org/. 

21. More precisely, since the penalties in problem (12) shrink the coefficients of W, we follow a two-step procedure: We 
first run our approach to determine the sets of nonzero rows and columns, and then compute Wu = C + XR + . 
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Figure 3: Explained variance of the CUR decompositions obtained for our sparsity-based approach 
and the sampling scheme from Mahoney and Drineas (2009). For the latter, we report the average 
and standard deviation of the results based on five initializations. From left to right and top to 
bottom, the curves correspond to the datasets 9_Tumors, Brain_Tumorsl, Leukemial and SRBCT. 



matrix. As a result, we could directly apply the same procedure as the one used in the other ex- 
periments. Instead, we further exploit the specific structure of problem (13): Notice that for a fixed 
vector e, the optimization with respect to w is a standard Lasso problem (with the vector of obser- 
vations y — e), 22 while for w fixed, we simply have a proximal problem associated to the sum of CI 
and the ^i-norm. Alternating between these two simple and computationally inexpensive steps, i.e., 
optimizing with respect to one variable while keeping the other one fixed, is guaranteed to converge 
to a solution of (13). In our simulations, this alternating scheme has led to a significant speed-up 
compared to the general procedure. 

A dataset with hand-segmented images is used to illustrate the effect of CI. 24 For simplicity, 
we use a single regularization parameter, i.e., X\ = hi, chosen to maximize the number of pixels 
matching the ground truth. We consider p = 200 images with n = 57600 pixels (i.e., a resolution 



22. Since successive frames might not change much, the columns of X exhibit strong correlations. Consequently, we use 
the LARS algorithm (Efron et al., 2004) whose complexity is independent of the level of correlation in X. 

23. More precisely, the convergence is guaranteed since the non-smooth part in (13) is separable with respect to w and e 
(Tseng, 2001). The result from Bertsekas (1999) may also be applied here, after reformulating (13) as a smooth 
convex problem under separable conic constraints. 

24. http : //re search. microsoft. com/ en-us/um/people/ jckrumm/wallf lower/test images . htm 
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of 120 x 160, times 3 for the RGB channels). As shown in Figure 4, adding Q. improves the back- 
ground subtraction results for the two tested images, by removing the scattered artifacts due to the 
lack of structural constraints of the ^i-norm, which encodes neither spatial nor color consistency. 
The group sparsity regularization Q. also improves upon the £i-norm but introduces block-artefacts 
corresponding to the non-overlapping group structure. 

5.5 Topographic Dictionary Learning 

Let us consider a set Y = [y 1 , . . . , y"] in R mx " of n signals of dimension m. The problem of dictionary 
learning, originally introduced by Olshausen and Field (1996), is a matrix factorization problem 
which aims at representing these signals as linear combinations of dictionary elements that are the 
columns of a matrix X = [x 1 , . . . ,x p ] in R mxp . More precisely, the dictionary X is learned along 
with a matrix of decomposition coefficients W = [w , . . . ,w w ] in M px '\ so that y 1 Xw 1 for every 
signal y ! . Typically, n is large compared to m and p. In this experiment, we consider for instance a 
database of n = 100000 natural image patches of size m = 12 x 12 pixels, for dictionaries of size 
p = 400. Adapting the dictionary to specific data has proven to be useful in many applications, 
including image restoration (Elad and Aharon, 2006; Mairal et al., 2009), learning image features in 
computer vision (Kavukcuoglu et al., 2009). The resulting optimization problem we are interested 
in can be written 



min 




where C is a convex set of matrices in M mx P whose columns have ^-norms less than or equal to 
one, 25 X is a regularization parameter and Q. is a sparsity-inducing norm. When £2 is the £\ -norm, we 
obtain a classical formulation, which is known to produce dictionary elements that are reminiscent 
of Gabor-like functions, when the columns of Y are whitened natural image patches (Olshausen and 
Field, 1996). 

Another line of research tries to put a structure on decomposition coefficients instead of consid- 
ering them as independent. Jenatton et al. (2010a, 201 1) have for instance embedded dictionary ele- 
ments into a tree, by using a hierarchical norm (Zhao et al., 2009) for Q.. This model encodes a rule 
saying that a dictionary element can be used in the decomposition of a signal only if its ancestors in 
the tree are used as well. In the related context of independent component analysis (ICA), Hyvarinen 
et al. (2001) have arranged independent components (corresponding to dictionary elements) on a 
two-dimensional grid, and have modelled spatial dependencies between them. When learned on 
whitened natural image patches, this model exhibits "Gabor-like" functions which are smoothly or- 
ganized on the grid, which the authors call a topographic map. As shown by Kavukcuoglu et al. 
(2009), such a result can be reproduced with a dictionary learning formulation, using a structured 
norm for £1. Following their formulation, we organize the p dictionary elements on a J~p x ^fp 
grid, and consider p overlapping groups that are 3 x 3 or 4 x 4 spatial neighborhoods on the grid (to 
avoid boundary effects, we assume the grid to be cyclic). We define £2 as a sum of ^-norms over 
these groups, since the ^-norm has proven to be less adapted for this task. Another formulation 
achieving a similar effect was also proposed by Garrigues and Olshausen (2010) in the context of 
sparse coding with a probabilistic model. 



25. Since the quadratic term in Eq. (14) is invariant by multiplying X by a scalar and W by its inverse, constraining the 
norm of X has proven to be necessary in practice to prevent it from being arbitrarily large. 
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(a) Original frame. 



(b) Estimated background with £2. 




'■ ' v ■ V 


^ ii'i =•• * 


(c) ^i,87.1%. 








■ 



(d) ii+Q. (non-overlapping), 96.3%. (e)£i+Sl (overlapping), 98.9%. 




(g) Original frame. 



(h) Estimated background with £2. 



(f) fi, another frame. 



(i) ^i, 90.5%. 




(j) £i +Q (non-overlapping), 92.6%. (k) t\ + O. (overlapping), 93.8%. 



(1) £2, another frame. 



Figure 4: Background subtraction results. For two videos, we present the original image y, the 
estimated background (i.e., Xw) reconstructed by our method, and the foreground (i.e., the sparsity 
pattern of e as a mask on the original image) detected with l\, l\ +H (non-overlapping groups) and 
with i\ + Q. Figures (f) and (1) present another foreground found with Q., on a different image, with 
the same values of X\,X2 as for the previous image. Best seen in color. 
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Figure 5: Topographic dictionaries with 400 elements, learned on a database of 12 x 12 whitened 
natural image patches. Left: with 3x3 cyclic overlapping groups. Right: with 4x4 cyclic overlap- 
ping groups. 



As Kavukcuoglu et al. (2009); Olshausen and Field (1996), we consider a projected stochastic 
gradient descent algorithm for learning X — that is, at iteration t, we randomly draw one signal y' 
from the database Y, compute a sparse code w' = argmin wgRP ^ l|y* — Xw'll! + X£2(w), and up- 
date X as follows: X <— n^[X — p(Xw' — y f )w' T ], where p is a fixed learning rate, and IT^ denotes 
the operator performing orthogonal projections onto the set C. In practice, to further improve the 
performance, we use a mini-batch, drawing 500 signals at eatch iteration instead of one (see Mairal 
et al, 2010a). Our approach mainly differs from Kavukcuoglu et al. (2009) in the way the sparse 
codes w' are obtained. Whereas Kavukcuoglu et al. (2009) uses a subgradient descent algorithm to 
solve them, we use the proximal splitting methods presented in Section 4. The natural image patches 
we use are also preprocessed: They are first centered by removing their mean value (often called DC 
component), and whitened, as often done in the literature (Hyvarinen et al, 2001; Garrigues and Ol- 
shausen, 2010). The parameter X is chosen such that in average ||y ! — Xw'||2 ~ 0.4 1 1 y* 1 1 2 for all new 
patch considered by the algorithm. Examples of obtained results are shown on Figure 5, and exhibit 
similarities with the topographic maps of Hyvarinen et al. (2001). Note that even though Eq. (14) is 
convex with respect to each variable X and W when one fixes the other, it is not jointly convex, and 
one can not guarantee our method to find a global optimum. Despite its intrinsic non-convex nature, 
local minima obtained with various optimization procedures have been shown to be good enough 
for many tasks (Elad and Aharon, 2006; Mairal et al., 2009; Kavukcuoglu et al., 2009). 

5.6 Multi-Task Learning of Hierarchical Structures 

As mentioned in the previous section, Jenatton et al. (2010a) have recently proposed to use a hierar- 
chical structured norm to learn dictionaries of natural image patches. In Jenatton et al. (2010a), the 
dictionary elements are embedded in a predefined tree T, via a particular instance of the structured 
norm £2, which we refer to it as Qtree> and call Q the underlying set of groups. In this case, using 
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the same notation as in Section 5.5, each signal y' admits a sparse decomposition in the form of a 
subtree of dictionary elements. 

Inspired by ideas from multi-task learning (Obozinski et al., 2010), we propose to learn the 
tree structure T by pruning irrelevant parts of a larger initial tree %. We achieve this by using 
an additional regularization term Qj int across the different decompositions, so that subtrees of % 
will simultaneously be removed for all signals y'. With the notation from Section 5.5, the approach 
of Jenatton et al. (2010a) is then extended by the following formulation: 



1 



1 



mm -Y ^||y i -Xw I l + X 1 n tree (w ; ') +X 2 n joint (W), (15) 
XeC,WeM.p yn n r-( L2 J 

where W = [w 1 , . . . , w"] is the matrix of decomposition coefficients in R pxn . The new regularization 
term operates on the rows of W and is defined as £2j j nt (W) = Y* g eg m&x ie{\ n} |w^,|. 26 The overall 
penalty on W, which results from the combination of Q. aee and Cl- }0 mt, is itself an instance of Q. with 
general overlapping groups, as defined in Eq (3). 

To address problem (15), we use the same optimization scheme as Jenatton et al. (2010a), i.e., 
alternating between X and W, fixing one variable while optimizing with respect to the other. The 
task we consider is the denoising of natural image patches, with the same dataset and protocol 
as Jenatton et al. (2010a). We study whether learning the hierarchy of the dictionary elements 
improves the denoising performance, compared to standard sparse coding (i.e., when n ttee is the 
£i-norm and X2 = 0) and the hierarchical dictionary learning of Jenatton et al. (2010a) based on 
predefined trees (i.e., X2 = 0). The dimensions of the training set — 50000 patches of size 8 x 8 for 
dictionaries with up to p = 400 elements — impose to handle extremely large graphs, with \E\ « 
\V\ ~ 4.10 7 . Since problem (15) is too large to be solved exactly sufficiently many times to select the 
regularization parameters (ki,7^) rigorously, we use the following heuristics: we optimize mostly 
with the currently pruned tree held fixed (i.e., X2 = 0), and only prune the tree (i.e., X2 > 0) every 
few steps on a random subset of 10000 patches. We consider the same hierarchies as in Jenatton 
et al. (2010a), involving between 30 and 400 dictionary elements. The regularization parameter 
is selected on the validation set of 25000 patches, for both sparse coding (Flat) and hierarchical 
dictionary learning (Tree). Starting from the tree giving the best performance (in this case the 
largest one, see Figure 6), we solve problem (15) following our heuristics, for increasing values 
of %2- As shown in Figure 6, there is a regime where our approach performs significantly better than 
the two other compared methods. The standard deviation of the noise is 0.2 (the pixels have values 
in [0, 1]); no significant improvements were observed for lower levels of noise. Our experiments 
use the algorithm of Beck and Teboulle (2009) based on our proximal operator, with weights r| g set 
to 1. We present this algorithm in more details in Appendix C. 



6. Conclusion 

We have presented new optimization methods for solving sparse structured problems involving sums 
of £2- or ^-norms of any (overlapping) groups of variables. Interestingly, this sheds new light on 
connections between sparse methods and the literature of network flow optimization. In particular, 
the proximal operator for the sum of ^-norms can be cast as a specific form of quadratic min-cost 
flow problem, for which we proposed an efficient and simple algorithm. 

26. The simplified case where Q. tlee and fljoint are the l\- and mixed £i/i?2-norms (Yuan and Lin, 2006) corresponds 
to Sprechmann et al. (2010). 
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Figure 6: Left: Hierarchy obtained by pruning a larger tree of 76 elements. Right: Mean square 
error versus dictionary size. The error bars represent two standard deviations, based on three runs. 



In addition to making it possible to resort to accelerated gradient methods, an efficient compu- 
tation of the proximal operator offers more generally a certain modularity, in that it can be used as a 
building-block for other optimization problems. A case in point is dictionary learning where prox- 
imal problems come up and have to be solved repeatedly in an inner-loop. Interesting future work 
includes the computation of other structured norms such as the one introduced in Jacob et al. (2009), 
or total-variation based penalties, whose proximal operators are also based on minimum cost flow 
problems (Chambolle and Darbon, 2009). Several experiments demonstrate that our algorithm can 
be applied to a wide class of learning problems, which have not been addressed before with convex 
sparse methods. 
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Appendix A. Equivalence to Canonical Graphs 

Formally, the notion of equivalence between graphs can be summarized by the following lemma: 
Lemma 2 (Equivalence to canonical graphs.) 

Let G = (V,E,s, t) be the canonical graph corresponding to a group structure Q. Let G' = (V,E' ,s,t) 
be a graph sharing the same set of vertices, source and sink as G, but with a different arc set E'. We 
say that G' is equivalent to G if and only if the following conditions hold: 

• Arcs ofE' outgoing from the source are the same as in E, with the same costs and capacities. 
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• Arcs ofE' going to the sink are the same as in E, with the same costs and capacities. 

• For every arc (g, j) in E, with (g, j) in V gr x V u , there exists a unique path in E' from g to j 
with zero costs and infinite capacities on every arc of the path. 

• Conversely, if there exists a path in E' between a vertex g in V gr and a vertex j in V u , then 
there exists an arc (g, j) in E. 

Then, the cost of the optimal min-cost flow on G and G' are the same. Moreover, the values of the 
optimal flow on the arcs (j,t), j in V u , are the same on G and G'. 

Proof. We first notice that on both G and G', the cost of a flow on the graph only depends on the 
flow on the arcs (j,t), j in V u , which we have denoted by \ in E. 

We will prove that finding a feasible flow 71 on G with a cost c(%) is equivalent to finding a 
feasible flow ji' on G' with the same cost c(tc) = c(n'). We now use the concept of path flow, which 
is a flow vector in G carrying the same positive value on every arc of a directed path between two 
nodes of G. It intuitively corresponds to sending a positive amount of flow along a path of the graph. 

According to the definition of graph equivalence introduced in the Lemma, it is easy to show 
that there is a bijection between the arcs in E, and the paths in E' with positive capacities on every 
arc. Given now a feasible flow 71 in G, we build a feasible flow %' on G' which is a sum of path 
flows. More precisely, for every arc a in E, we consider its equivalent path in E', with a path flow 
carrying the same amount of flow as a. Therefore, each arc a' in E' has a total amount of flow that 
is equal to the sum of the flows carried by the path flows going over a'. It is also easy to show that 
this construction builds a flow on G' (capacity and conservation constraints are satisfied) and that 
this flow 7t' has the same cost as 7t, that is, c{%) = c(tc'). 

Conversely, given a flow n' on G', we use a classical path flow decomposition (see Bertsekas, 
1998, Proposition 1.1), saying that there exists a decomposition of ji' as a sum of path flows in E'. 
Using the bijection described above, we know that each path in the previous sums corresponds to a 
unique arc in E. We now build a flow % in G, by associating to each path flow in the decomposition 
of 7t', an arc in E carrying the same amount of flow. The flow of every other arc in E is set to zero. 
It is also easy to show that this builds a valid flow in G that has the same cost as ji'. ■ 



Appendix B. Convergence Analysis 

We show in this section the correctness of Algorithm 1 for computing the proximal operator, and of 
Algorithm 2 for computing the dual norm £1*. 

B.l Computation of the Proximal Operator 

We first prove that our algorithm converges and that it finds the optimal solution of the proximal 
problem. This requires that we introduce the optimality conditions for problem (6) derived from Je- 
natton et al. (2010a, 2011) since our convergence proof essentially checks that these conditions are 
satisfied upon termination of the algorithm. 

Lemma 3 (Optimality conditions of the problem (6) from Jenatton et al. 2010a, 2011) 

The primal-dual variables (w,^) are respectively solutions of the primal (4) and dual problems (6) 
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if and only if the dual variable £, is feasible for the problem (6) and 
w = u-£ 



wj^= ||wj„.||$*||i and ||^ g ||i = fa\ g , 
or w P = 0. 



Note that these optimality conditions provide an intuitive view of our min-cost flow problem. 
Solving the min-cost flow problem is equivalent to sending the maximum amount of flow in the 
graph under the capacity constraints, while respecting the rule that the flow coming from a group g 
should always be directed to the variables Uj with maximum residual u, — Lge£ ^fj- This point can 
be more formaly seen by noticing that one of the optimality conditions above corresponds to the 
case of equality in the £\/£co Holder inequality 

Before proving the convergence and correctness of our algorithm, we also recall classical prop- 
erties of the min capacity cuts, which we intensively use in the proofs of this paper. The procedure 
computeFlow of our algorithm finds a minimum (s,t)-cut of a graph G = (V,E,s,t), dividing the 
set V into two disjoint parts V + and V~. V + is by construction the sets of nodes in V such that there 
exists a non-saturating path from s to V, while all the paths from s to V~ are saturated. Conversely, 
arcs from V + to t are all saturated, whereas there can be non-saturated arcs from V~ to t. Moreover, 
the following properties, which are illustrated on Figure 7 , hold 

• There is no arc going from V + to V~. Otherwise the value of the cut would be infinite (arcs 
inside V have infinite capacity by construction of our graph). 

• There is no flow going from V~ to V + (see Bertsekas, 1998). 

• The cut goes through all arcs going from V + to t, and all arcs going from s to V - . 




Figure 7: Cut computed by our algorithm. V + = V+ U V+, with V+ = {g}, V+ = {1,2}, and V ~ = 
V~ U V gr , with V gr = {h}, V~ = {3}. Arcs going from s to V~ are saturated, as well as arcs going 
from V + to t. Saturated arcs are in bold. Arcs with zero flow are dotted. 



28 



Convex and Network Flow Optimization for Structured Sparsity 



Recall that we assume (cf. Section 3.3) that the scalars u 7 are all non negative, and that we add 
non-negativity constraints on With the optimality conditions of Lemma 3 in hand, we can show 
our first convergence result. 

Proposition 1 (Convergence of Algorithm 1) 

Algorithm 1 converges in a finite and polynomial number of operations. 

Proof. Our algorithm splits recursively the graph into disjoints parts and processes each part recur- 
sively. The processing of one part requires an orthogonal projection onto an £i -ball and a max-flow 
algorithm, which can both be computed in polynomial time. To prove that the procedure converges, 
it is sufficient to show that when the procedure computeFlow is called for a graph (V,E,s,t) and 
computes a cut (V + ,V~), then the components V + and V~ are both non-empty. 

Suppose for instance that V~= 0. In this case, the capacity of the min-cut is equal to Y,jev u lj> 
and the value of the max-flow is Eyev„^>- Using the classical max-flow/min-cut theorem (Ford and 
Fulkerson, 1956), we have equality between these two terms. Since, by definition of both y and ^, 
we have for all j in V u , ^ • < Jj, we obtain a contradiction with the existence of j in V u such that 

Conversely, suppose now that V + = 0. Then, the value of the max-flow is still Y<j€V u ^j> an ^ 
the value of the min-cut is ^L ge y gl T|g- Using again the max-flow/min-cut theorem, we have that 
HjevAj = ^L g ev gr T\g- Moreover, by definition of y, we also have Zjev u lj < Hjevjj < ^Lgev^g, 
leading to a contradiction with the existence of j in V u satisfying ^ • / y-. We remind the reader of 
the fact that such a j € V u exists since the cut is only computed when the current estimate \ is not 
optimal yet. This proof holds for any graph that is equivalent to the canonical one. ■ 

After proving the convergence, we prove that the algorithm is correct with the next proposition. 

Proposition 2 (Correctness of Algorithm 1) 

Algorithm 1 solves the proximal problem ofEq. (4). 

Proof. For a group structure Q, we first prove the correctness of our algorithm if the graph used 
is its associated canonical graph that we denote Go = (Vo,Eo,s,t). We proceed by induction on the 
number of nodes of the graph. The induction hypothesis Of(k) is the following: 

For all canonical graphs G = (V = V u UV gr ,E,s,t) associated with a group structure Qy with 
weights (r\g) g eg v such that \V\ < k, computeFlow(V,E) solves the following optimization prob- 
lem: 

min £ Uuj - ^ s-t Vg G V gr , £ ^ < Xt\ 8 and % = 0, Vj $ g. (16) 

&j)i&ux&F jev u z gev gr jev u 

Since ^y = Q, it is sufficient to show that i#"(|Vb|) to prove the proposition. 

We initialize the induction by 9f{2), corresponding to the simplest canonical graph, for which 
\Vgr\ = \V U \ = 1)- Simple algebra shows that 9({2) is indeed correct. 

We now suppose that 9{{k') is true for all k' < k and consider a graph G = (V,E,s,t), \V\ = k. 
The first step of the algorithm computes the variable (jj)jev u by a projection on the £ i-ball. This is 
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itself an instance of the dual formulation of Eq. (6) in a simple case, with one group containing all 
variables. We can therefore use Lemma 3 to characterize the optimality of (jj)jev u , which yields 

ZjevMl = ( max ;W„ Y/l) Ey'ev B 1j and Zjevjj = ^L g ev gr ^g, n ~ 

or u 7 -Y y = 0, Vj€V u . K ' 

The algorithm then computes a max-flow, using the scalars y- as capacities, and we now have two 
possible situations: 

1. If ^ • = y - for all j in V M , the algorithm stops; we write Wj = Uj — ^ • for j in V M , and using 
Eq. (17), we obtain 

Zjev u w£j = (max ie y„ \\Vj\)ZjeV u lj and EjevJ; = ^LgeV^g, (m 
or w y =0, V/€ V„. 

We can rewrite the condition above as 

Since all the quantities in the previous sum are positive, this can only hold if for all g G V^ r , 

£w$ = (max|w y |)£^. 

Moreover, by definition of the max flow and the optimality conditions, we have 
V S G Vgr, £ %*; < fo] g , and £ ^. = X £ T| g , 

;'GV„ ;GV„ §GV sr 

which leads to 

;ev„ 

By Lemma 3, we have shown that the problem (16) is solved. 

2. Let us now consider the case where there exists j in V u such that £ • / y.. The algorithm splits 
the vertex set V into two parts V + and V~, which we have proven to be non-empty in the 
proof of Proposition 1. The next step of the algorithm removes all edges between V + and V~ 
(see Figure 7). Processing (V + ,E + ) and (V~ ,E~) independently, it updates the value of the 
flow matrix j G V„, g G V gr , and the corresponding flow vector ^, j G V M . As for V, we 
denote by V+ = V+D V„, V~ =V~H V u and V+ = V+D V gr , V~ =V~H V gr . 

Then, we notice that (V + ,E + ,s,t) and (V~ ,E~ ,s,t) are respective canonical graphs for the 
group structures g v+ = {gC\V+ \ g G V gr }, and Qy- — {gd V~ \ g G V gr }. 

Writing wy = u ; - — ^- for y'inV^and using the induction hypotheses and ^(jV^I), 

we now have the following optimality conditions deriving from Lemma 3 applied on Eq. (16) 
respectively for the graphs (V + ,E + ) and (V~ ,E~): 

v 8 s v g U'±gnv+, { ^^ A °° Ljeg ^ and Zje *% = H? ' d9) 
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and 



v „ ey - „'A„ ny - { ^ = \\M\^A) and Zjefi) = M s , 



(20) 



We will now combine Eq. (19) and Eq. (20) into optimality conditions for Eq. (16). We first 
notice that V„ + = g since there are no arcs between V + and V~ in E (see the properties 
of the cuts discussed before this proposition). It is therefore possible to replace g f by g in 
Eq. (19). We will show that it is possible to do the same in Eq. (20), so that combining these 
two equations yield the optimality conditions of Eq. (16). 

More precisely, we will show that for all g G V~ and j G gnV M + , |wy| < ma.x legnv - |w/|, in 
which case g' can be replaced by g in Eq. (20). This result is relatively intuitive: (s, V + ) and 
{V~,t) being an (s,t)-cut, all arcs between s and V~ are saturated, while there are unsaturated 
arcs between s and V + ; one therefore expects the residuals u ; — to decrease on the V + side, 
while increasing on the V~ side. The proof is nonetheless a bit technical. 

Let us show first that for all g in Hw^H^ < maxy G y [( |uy — y -|. We split the set V + into 
disjoint parts: 

V g + r + = {g€V+ s.t. ||w,L<max|uy-Yy|}, 

;ev„ 

V u ++ = {j€V u + s.t. 3 5 GV++, yGg}, 
^ _ =V£\V+ + = {*€V+ s.t. ||wX>max|uy-Yy|}, 

v+-±v+\v+ + . 

II Li \ It 

As previously, we denote V + ~ = V M + ~U V g + r ~ and V ++ =V H ++ U V g + r + . We want to show that 
V*~~ is necessarily empty. We reason by contradiction and assume that V gr ~ / 0. 

According to the definition of the different sets above, we observe that no arcs are going from 
V ++ to V + ~, that is, for all g in V g + r + , g(~) = 0- We observe as well that the flow from 
V g + r ~ to V M ++ is the null flow, because optimality conditions (19) imply that for a group g only 
nodes j G g such that wy = ||wg||oo receive some flow, which excludes nodes in V M ++ provided 
V gr ~ / 0; Combining this fact and the inequality Y, geV +hr\ g > £y g y+Yy (which is a direct 
consequence of the minimum (s^-cut), we have as well 

I ^g> E Yy 

g€V g + r - jev u + - 

Let j G V M + ~, if £y / then for some g G V g +~ such that y receives some flow from g, which 
from the optimality conditions (19) implies wy = ||w g ||oo; by definition of V+T , ||wg||oo > 
Uy - Yy- But since at the optimum, wy = Uy — £ •, this implies that £ • < Yy, and in turn that 

*< E ^= E Sy< E Yy 

and this is a contradiction. 
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We now have that for all g in , 1 1 w g \ | ^ < ma.xj e v u I u/ — Y/ I • The proof showing that for all g 
in V gr , llwgll^ > maxj e v u |u/ — Jj\, uses the same kind of decomposition for V~, and follows 
along similar arguments. We will therefore not detail it. 

To summarize, we have shown that for all g G V~ and j G g n V H + , |wy| < ma.x legnv - |w/|. 
Since there is no flow from V~ to V + , i.e., = for g in and j in V M + , we can now 
replace the definition of g' in Eq. (20) by g' = gC\V u , the combination of Eq. (19) and Eq. (20) 
gives us optimality conditions for Eq. (16). 

The proposition being proved for the canonical graph, we extend it now for an equivalent graph 
in the sense of Lemma 2. First, we observe that the algorithm gives the same values of y for two 
equivalent graphs. Then, it is easy to see that the value £, given by the max-flow, and the chosen 
(s,?)-cut is the same, which is enough to conclude that the algorithm performs the same steps for 
two equivalent graphs. ■ 



B.2 Computation of the Dual Norm £2* 

As for the proximal operator, the computation of dual norm Q.* can itself be shown to solve another 
network flow problem, based on the following variational formulation, which extends a previous 
result from Jenatton et al. (2009): 

Lemma 4 (Dual formulation of the dual-norm Q.*.) 

Let K G R p . We have 

Q*(k)= min x s.t. V t, 8 = K, and Vg G Q, \\^ 8 \\i < vr\ g with ^ 8 j = 0ifj^g. 

Proof. By definition of Q.* (k), we have 

Q*(k) = max z T K. 

£2(z)<l 

By introducing the primal variables (a g ) ge g G Rl^l, we can rewrite the previous maximization 
problem as 

Q*(k) = max k t z, s.t. V g G Q, ||zJoo < oc„, 

with the additional \ Q\ conic constraints ||zg||oo < OCg. This primal problem is convex and satisfies 
Slater's conditions for generalized conic inequalities, which implies that strong duality holds (Boyd 
and Vandenberghe, 2004). We now consider the Lagrangian L defined as 

£(z,a„T,y^)=K T z + T(l-£r|,a,)+£ ("A (f\ 

with the dual variables {x, {y g ) ge g^} G K + xl^l xM^^I such that for all g G Q, ^ = if j £ g 
and ||^ ||i < J g . The dual function is obtained by taking the derivatives of L with respect to the 
primal variables z and (a g ) gf zg and equating them to zero, which leads to 

V/€{l,...,p}, Kj+Lgzg*!} =0 

Vgeg, xri g -Y g =0. 
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After simplifying the Lagrangian and flipping the sign of the dual problem then reduces to 

min t s t h e 0. - >Ph*J =I*eg¥j and = if a g, 
5€w x iffi,x€R \v#e £,||^||i <xri ? , 

which is the desired result. ■ 

We now prove that Algorithm 2 is correct. 

Proposition 3 (Convergence and correctness of Algorithm 2) 

Algorithm 2 computes the value of the dual norm of Eq. (9) in a finite and polynomial number of 
operations. 

Proof. The convergence of the algorithm only requires to show that the cardinality of V in the 
different calls of the function computeFlow strictly decreases. Similar arguments to those used in 
the proof of Proposition 1 can show that each part of the cuts (V + ,V~) are both non-empty. The 
algorithm thus requires a finite number of calls to a max-flow algorithm and converges in a finite 
and polynomial number of operations. 

Let us now prove that the algorithm is correct for a canonical graph. We proceed again by 
induction on the number of nodes of the graph. More precisely, we consider the induction hypothesis 
H'{k) defined as: 

for all canonical graphs G = (V,E,s,t) associated with a group structure Qv and such that |V| <k, 
dualNormAux(V = V u UV gr ,E) solves the following optimization problem: 

minx s.t. Vj G V u ,Kj = £ £*, and Vg G V gr , £ < zr\ g with = if j $ g. (21) 

We first initialize the induction by Of{T) (i.e., with the simplest canonical graph, such that |V gr | = 
\V U \ = 1). Simple algebra shows that Hil) is indeed correct. 

We next consider a canonical graph G = (V,E,s,t) such that \V \ =k, and suppose that 9f'{k— 1) 
is true. After the max-flow step, we have two possible cases to discuss: 

1. If £,j = y . for all j in V u , the algorithm stops. We know that any scalar x such that the con- 
straints of Eq. (21) are all satisfied necessarily verifies L§GV gr xr lg ^ Hjev u K j- We have indeed 
that Y<gev gr xr lg is the value of an (s, t)-cut in the graph, and Y*jev u K j is tn e value of the max- 
flow, and the inequality follows from the max-flow/min-cut theorem (Ford and Fulkerson, 
1956). This gives a lower-bound on x. Since this bound is reached, x is necessarily optimal. 

2. We now consider the case where there exists j in V u such that £ • / Kj, meaning that for 
the given value of X, the constraint set of Eq. (21) is not feasible for ^, and that the value of X 
should necessarily increase. The algorithm splits the vertex set V into two non-empty parts V + 
and V~ and we remark that there are no arcs going from V + to V~, and no flow going from V~ 
to V + . Since the arcs going from s to V~ are saturated, we have that Y, g ev~ r %r \g — ^j€V„ K J- 
Let us now consider x* the solution of Eq. (21). Using the induction hypothesis 9{'(\V ~ |), the 
algorithm computes a new value x' that solves Eq. (21) when replacing V by V~ and this new 
value satisfies the following inequality Lgev g 7 t'^g > E/ey- K ;'- The value of x' has therefore 
increased and the updated flow \ now satisfies the constraints of Eq. (21) and therefore x' > X*. 
Since there are no arcs going from V + to V~, X* is feasible for Eq. (21) when replacing V by 
V~ and we have that x* > x' and then x' = X*. 



33 



Mairal, Jenatton, Obozinski and Bach 



To prove that the result holds for any equivalent graph, similar arguments to those used in the proof 
of Proposition 1 can be exploited, showing that the algorithm computes the same values of x and 
same (j,?)-cuts at each step. ■ 



Appendix C. Algorithm FISTA with duality gap 

In this section, we describe in details the algorithm FISTA (Beck and Teboulle, 2009) when applied 
to solve problem (2), with a duality gap as the stopping criterion. The algorithm, as implemented in 
the experiments, is summarized in Algorithm 3. 

Without loss of generality, let us assume we are looking for models of the form Xw, for some 
matrix X G W xp (typically, a linear model where X is the design matrix composed of n observations 
in W). Thus, we can consider the following primal problem 

min/(Xw)+A£2(w), (22) 

wgRp 

in place of problem (2). Based on Fenchel duality arguments (Borwein and Lewis, 2006), 

/(Xw) +XQ(w) +/*(-k), for weR p ,K£ W and IT (X t k) < X, 

is a duality gap for problem (22), where /*(k) = sup z [z T K — /(z)] is the Fenchel conjugate of 
/ (Borwein and Lewis, 2006). Given a primal variable w, a good dual candidate K can be obtained 
by looking at the conditions that have to be satisfied by the pair (w,k) at optimality (Borwein and 
Lewis, 2006). In particular, the dual variable K is chosen to be 

K = -p" 1 V/(Xw), with p = max{^ 1 Q*(X T V/(Xw)),l}. 

Consequently, computing the duality gap requires evaluating the dual norm £2*, as explained in 
Algorithm 2. We sum up the computation of the duality gap in Algorithm 3. Moreover, we refer to 
the proximal operator associated with XQ. as prox^. 27 

In our experiment, we choose the line-search parameter v to be equal to 1.5. 

Appendix D. Speed comparison of Algorithm 1 with parametric max-flow algorithms 

As shown by Hochbaum and Hong (1995), min-cost flow problems, and in particular, the dual 
problem of (4), can be reduced to a specific parametric max-flow problem. We thus compare our 
approach (ProxFlow) with the efficient parametric max-flow algorithm proposed by Gallo et al. 
(1989) and a simplified version of the latter proposed by Babenko and Goldberg (2006). We refer 
to these two algorithms as GGT and SIMP respectively. The benchmark is established on the same 
datasets as those already used in the experimental section of the paper, namely: (1) three datasets 
built from overcomplete bases of discrete cosine transforms (DCT), with respectively 10 4 , 10 5 and 
10 6 variables, and (2) images used for the background subtraction task, composed of 57600 pixels. 
For GGT and SIMP, we use the paraF software which is a C++ parametric max-flow implementa- 
tion available at http : / /www. avglab . com/ andrew/ soft .html. Experiments were conducted on 

27. As a brief reminder, it is defined as the function that maps the vector u in W to the (unique, by strong convexity) 
solution of Eq. (4). 
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Algorithm 3 FISTA procedure to solve problem (22). 

1: Inputs: initial G W, Q.,X>0, £ gap > (precision for the duality gap). 

2: Parameters: v > 1, Lq > 0. 

3: Outputs: solution w. 

4: Initialization: y^) = W(o), ti = l,k=l. 

5: while { computeDualityGap(w( y t_ 1 )) > £ gap } do 

6: Find the smallest integer s k > such that 

7: /(prox [W1] (y w )) </(y (yt) )+Aj ) V/(y w ) + §||A w ||2, 

8: with L = L k v Sk and A (t) = y w -prox [W1] (y {k) ). 

9: L k ^L k ^V s ". 

10: w w <- prox [X n] (y w ). 

11: t k+1 <r- (l + ^/l+?2)/2. 

12: y(i+l)<-W (t) + ^(w {t) -W N) ). 

13: &<K&+1. 

14: end while 

15: Return: w ^— W( fc _x). 

Procedure computeDualityGap(w) 

1: K^-p-^^Xw), with p = max {^ 1 ^*(X T V/(Xw)), l}. 

2: Return: /(Xw) +X^(w) +/*(-k). 



a single-core 2.33 Ghz. We report in the following table the average execution time in seconds of 
each algorithm for 5 runs, as well as the statistics of the corresponding problems: 



Number of variables p 


10000 


100000 


1000000 


57600 


\y\ 


20000 


200000 


2000000 


57600 


\E\ 


110000 


500000 


11000000 


579632 


ProxFlow (in sec.) 


0.4 


3.1 


113.0 


1.7 


GGT (in sec.) 


2.4 


26.0 


525.0 


16.7 


SIMP (in sec.) 


1.2 


13.1 


284.0 


8.31 



Although we provide the speed comparison for a single value of X (the one used in the corresponding 
experiments of the paper), we observed that our approach consistently outperforms GGT and SIMP 
for values of X corresponding to different regularization regimes. 
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